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Abstract 
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This  report  consists  of  two  papers  describing  networks  for  parallel 
matrix  operations.  In  thg  first  paper,  "Area-time  optimal  VTSI^ networks 
for  multiplying  matrices/*  we  describe  a  class  of  V1£I  networks  having 
chip  area  A  for  multiplying  two  n  X  n  matrices  in  time  T,  with  an  <«•"  * 

area  X  time  product  A*T  A  0(g.  ).  These  networks  achieve  Savage's^  c „ - 
lower  bound  to  this, complexity  measure  for  any  T  such  that  log  n  ^  T  ^  n. 

The  second  paper ,, ’’Optimal  Integra  ted -circuit  implementation  of  triangular 
matrix  inversion/'  describes  a  class  of  VLSI  implementations  of  algorithms 
for  inverting  an  n  X  n  triangular  matrix.  These  networks  have  area  A  and 
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time  T,  with  an  area  x  time  product  AT  *  0C&/)  for  all  values  of  T  such 
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that  0(loig  n)  SS  T  S  0(n).  Since  there  is  a  simple  reduction  of  matrix 
/multiplication  to  Inversion  of  a  triangular  matrix,  due  to  Savage's  result 
/  the  presented  networks  are  optimal  in  the  VLSI  model. 
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1.  Introduction 


Savage  [1]  has  shown  that,  under  reasonable  assumptions  reflecting 

current  VLSI  technology  [2,3,4],  the  designer  of  any  circuit  for  nultiplylng 

two  n  x  n  matrices  is  confronted  with  a  tradeoff  between  chip  area  A  and 

2  4 

computation  time  T  expressed  by  A*T  »Q(n  ).  Designs  of  Kung-Lelserson 

2 

[5]  meet  that  bound  for  T  ■  0(n)  and  minimal  area  A  ■  0(n  ). 

The  purpose  of  this  note  is  to  illustrate  a  general  design  scheme  of 

VLSI  networks  for  multiplying  two  n  x  n  matrices.  According  to  this  scheme, 

2 

a  network  with  optimal  value  of  the  statistics  AT  can  be  designed  for  any 
value  of  T(1)  in  the  range  [logn,n]  .  ^  The  scheme  makes  use  of  a  recursively 
defined  matrix  multiplier  -  to  be  described  next  -  and  implements  pipelining 
in  an  efficient  way. 


2.  A  Straightforward  Matrix  Multiplier 


Let  U  -  ^  and  V  ■  ^ ^  ®  ^  be  two  s  X  s  matrices.  We  present  a 

network  for  computing  the  matrix  product  II  X  V  in  a  recursive  fashion. 


assuming  that  we  know  how  to  construct  circuits  for  multiplying  two 
8  8 

Y  X  2  matrl-ce8»  such  as  a>  b,  c,  d,  e,  f,  g,  h. 

The  layout  for  such  a  rectangular  network  in  figure  1,  where  one  finds 

2 

8  recursively  defined  multipliers;  lines  drawn  represent  bundles  of  s  / 4 

3  8 

wires,  corresponding  to  the  parallel  transmission  of  full  j  x  j  sub- 

matrices;  the  network  comprises  4  matrix  adders,  which  are  placed  in  the 
2  2 

~  x  —  area  occupied  by  the  intersection  of  two  bundles  of  wires  as 
shown  in  figure  2. 


^Computation  time  T  is  expressed  in  units,  which  also  reflect  the 
current  state  of  technology. 

2) 

'  All  logarithms  in  this  paper  are  to  the  base  of  2. 
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For  specificity,  we  assume  the  matrix  elements  to  be  drawn  from  a 
finite  ring,  so  that  an  elementary  finite  chip  can  be  used  for  nultip lying 
and  adding  elements  in  that  ring  in  constant  time  and  area.  Our  recursive 
definition  stops  at  s  •  1,  where  we  use  the  elementary  circuit  for  ring 
multiplication . 

The  width  w(s)  of  our  rectangular  network  satisfies  the  recurrence 
2 

w(s)  -  6  X  +  3w(|)  ,  w(l)  -  wL  , 

where  X  is  the  wire  width  as  in  [2]  or  [4], and  w^  the  width  of  the  ele¬ 
mentary  multiplier.  (We  assume  elementary  adders  to  have  width  and 
height  X.)  The  solution  to  this  recurrence  is  (where  we  have  set  s  *  2P)  : 

»  /  log  3  \ 

w(2p)  -  6<4P-3P)X  +  3pWl,  thus  w(s)  -  0(s  )*X  +  O^s  )  . 

2  2 

Similarly,  the  height  h(s)  of  our  circuit  satisfies  h(s)  -  11  »X  +  3h(|-), 
h(l)  *  h^, where  is  the  height  of  the  elementary  multiplier.  The  exact 
solution  is  given  by  h(2P)  *  11  (4P-3P)X  +  3Ph^;  thus  h(s)  -  0(s2)  and 

4 

the  total  chip  area  A  »  h  X  w  *  Q(s  ) . 

Notice  that  the  network  consists  of  2  flog  si +1  levels,  so  subdivided: 
the  first  ["log  si  levels  are  buffer-drivers,  whose  only  purpose  is  to 
construct  two  copies  of  their  input;  next,  there  is  a  single  level  of 
elementary  multipliers,  followed  by  l"log  si  levels  of  adders.  Therefore 
the  computation  time  of  the  network  just  described  is 

T(8>  »  Tiog  si  ‘t^t+riog  si  tfi,  where  tc>  t^,  and  tfi  are  respectively  the 
times  for  copying,  multiplying,  and  adding,  for  a  total  of  T(s)  *  O(log  s). 
Furthermore,  it  is  important  to  point  out  that  at  any  time  before  the  end 
of  the  computation  all  intermediate  results  are  on  one  and  the  same  level 
of  the  matrix  multiplier,  which  is  therefore  ideally  suited  for  pipelined 
operation.  This  straightforward  scheme,  when  used  for  multiplying  nXn 


4 


4 

matrices,  has  area  A  *  0(n  )  and  time  T  ■  O(logn),  and  thus  does  not  yet 
2  4 

meet  the  AT  ■  fl(n  )  bound. 


3.  Pipelining  Strategies 

Consider  now  two  n  X  n  matrices  A  and  B,  and  decompose  each  of  them 
into  r^  blocks  of  size  ^  x  ^  .  Specifically  for  A  we  obtain 
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and  similarly  for  B . 

If  we  define  the  n  X  n  matrices 


C 


j 


(j  *  1.2 . r). 


obviously  C*AXB  =  C^  +  C2  +  ...  +  Cr>  Thus  the  product  of  A  and  B 
can  be  obtained  by  "accumulating"  the  matrices  C^,...,Cr.  We  now  show 
how  the  calculation  of  these  matrices  and  their  accumulation  can  be 
pipelined. 

The  scheme  is  a  classical  one-way  pipelining  application  (see  [6],  ch.  9). 
Consider  the  r  X  r  square  array  of  modules  shown  in  figure  3.  For  the  time 
being  we  assume  that  each  module  has  a  register  c  and  receives  two  operands 
a  and  b  (a  on  its  "west"  input  and  b  on  its  "north"  input);  it  performs 
the  operation  c  a  X  b,  and  passes  on  a  to  the  "east"  and  b  to  the 
"south".  Here  a,  b,  and  c  are  (n/r)  X  (n/r)  matrices. 
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# 

« 


Figure  3.  General  layout  of  the  network. 

T 

To  compute  C^,  we  feed  [A^  An]  and  [B^  •••  with  the 

timing  as  shown  in  figure  3  at  uniform  speed.  At  each  step  the  front 
of  the  moving  data  lies  on  a  secondary  diagonal  of  the  array;  the 
modules  on  this  diagonal  compute  a  product  and  store  it,  and  it  is 
obvious  that  after  (2r-l)  such  steps  the  matrix  is  stored  in  the 
module  registers.  Since  at  each  step  only  one  diagonal  of  the  array  is 
active,  the  pipelining  can  be  naturally  Implemented.  Specifically,  A  is 
pipelined  from  left  to  right  and  B  from  top  to  bottom;  each  module  is 
now  to  be  viewed  as  an  inner  product  processor,  which  executes  c  c  +ab 
thereby  accumulating  the  matrix  C  *  A  X  !  (see  figure  4) . 
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Figure  4.  Pipelining  scheme. 

The  structure  of  the  inner  product  module  is  shown  in  figure  5.  It 


is  easily  realized  that  the  module,  which  incorporates  an  s X  s 
of  the  type  described  in  Section  2,  with  s=n/r,  has  width  and 


multiplier 
height  both 
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The  computation  of  the  product  matrix  C  is  effected  by  a  serial 

pipelining  both  through  the  array  and  the  multipliers;  since  the  former 

has  r  levels  and  the  latter  has  O(log(n/r))  levels,  the  matrix  C  is 

available  after  time  T  -  O(r+log(n/r) ) ,  and  can  be  shifted  out  in  either 

row-or  column-major  order  (shifting  of  the  output  actually  may  begin 

even  before  the  last  block  of  C  has  been  computed). 

To  evaluate  the  performance  of  the  scheme,  we  note  that  the  array 

consists  of  a  compact  mesh  of  r  x  r  modules,  each  of  which  has  area 
4  4 

0(n  /r  ).  Thus  the  network  area  is 

As  for  the  time  T,  we  have  just  shown  that 

T  -  0<r+log  ~) . 

Combining  these  two  results  we  obtain 

AT2  -  r2  x  0 (C1  +  7  log  J)2) 

For  all  values  of  T  in  the  range  [logn,n],  the  term  0((1  +  ^  log  ^)2) 
is  bounded  by  a  constant,  whence 

AT2  -  0(n4) 

thereby  substantiating  our  earlier  claim. 

Remark .  There  is  another  pipelining  strategy  -  apparently 
unrelated  to  the  preceding  one  -  which  also  allows  us  to  meet  Savage's 
bound,  but  for  a  smaller  interval  of  computation  times  T.  This  strategy 
is  worth  reporting. 


8 

Again,  regard  an  nxn  matrix  as  an  n/txn/t  matrix  whose  elements  are 
rx  r  blocks.  Using  a  single  n/rxn/r  matrix  multiplier  of  the  type  dis¬ 
played  in  figure  1,  we  adopt  the  following  pipelining  scheme: 

2 

1)  the  r  elements  of  each  block  for  both  matrices  A  and  B  are  fed 

2 

serially  on  one  input  line  (there  are  (n/r)  such  lines); 

2)  the  elementary  multiplier  (see  Section  2)  must  now  have  the 

capability  of  performing  an  r  x  r  matrix  multiplication.  Such 

elementary  multiplier  could  be,  for  example,  a  hexagonal  mesh  of 

2 

the  Kung-Leiserson  [5]  type,  with  area  0(r  )  and  time  0(r).  The 

operands  arrive  serially,  must  be  stored,  then  processed,  and 

the  result  finally  must  be  released  serially;  obviously  the 

2 

arrival  and  release  pipelining  times  0(r  )  predominate  over  the 
multiplication  time  0(r). 

3)  The  result  is  n/rxn/r  matrix,  whose  elements  are  rXr  blocks 
which  appear  serially  on  each  output  line. 

2 

We  evaluate  the  performance  of  the  scheme.  As  long  as  r  ^  O(logn) 

2 

we  have  for  the  computation  time  that  T»0(r  ).  As  to  the  area,  we  note 
* 

that  the  width  w  of  the  network  is  given  by 


where  the  width  of  the  elementary  multiplier  is  now  0(r).  It  follows 

that,  as  long  as  r<0(n^  ^  °8  3 )  /  ( 3  log  3)^  wg  have  w*  ■  0(n^/r^);  a  similar 

*  4,4 

results  holds  for  the  height  h  of  the  network,  whence  A  ■  0(n  / r  ).  We 
conclude  that 


2  4 

AT  -  0(n  ) 


0.58 

(i.e.,  it  is  optimal)  for  all  values  of  T  such  that  0(logn)  :£T^0(n  '  ). 
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1.  Introduction 

Increasing  attention  has  been  paid  recently  to  the  design  of  networks 
for  the  direct  implementation  of  several  interesting  algorithms  using  the 
integrated-circuit  technology  (VLSI);  particularly,  combinatorial  and 
numerical  problems  have  been  the  target  of  these  investigations  [1-4]. 

Among  numerical  problems,  several  workers  have  directed  their  attention 
to  matrix  computations  [1,2,5],  and,  as  regards  the  design  of  networks, 
have  found  that  the  mesh  interconnection  of  computing  modules  is  partic¬ 
ularly  attuned  to  this  class  of  problems,  leading  to  optimal  realiza¬ 
tions  [5,6]  in  the  VLSI  model  [7,8]. 

In  this  paper  we  consider  the  problem  of  designing  VIi>I  net¬ 
works  for  inverting  a  nonsingular  triangular  matrix.  The  design 
complies  with  specifications  of  the  VI£I  model  of  computation 
recently  proposed  by  Mead,  Conway,  and  Thompson  [7,8],  In  this  model, 
the  network  is  a  computation  graph  consisting  of  nodes  (processing 
modules)  and  wires.  Wires  have  unit  width  and  are  partitionable  into 
two  orthogonal  sheaves.  A  data  item  takes  a  unit  of  time  to  propagate 

along  a  wire  from  node  to  node  (processing  time  is  thus  absorbed  into 

2 

propagation  time).  The  usual  complexity  metric  is  the  area  X  time 
2 

product  (AT  ) ,  which  embodies  a  trade-off  between  production  cost  (chip 

area  A)  and  incremental  cost  (time  T). 

Within  this  model,  Savage  [6]  has  recently  proved  the  following 

interesting  result:  any  VLSI  design  for  the  multiplication  of  two  n  X  n 

matrices,  with  chip  area  A  and  computation  time  T,  must  satisfy  the 
2  4 

lower  bound  AT  2  C  n  ,  for  some  constant  C.  In  [5]  the  authors  demonstrate 
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the  existence  of  VLSI  networks  for  nultiplying  n  x  n  matrices  with 
2  4 

AT  ■  0(n  )  for  any  confutation  time  T  within  the  bounds  logn 
2 

Note  that  an  AT  2  Cn  bound  also  holds  for  the  problem  of  Inverting  a 
nonsingular  n  X  n  triangular  matrix,  since  matrix  multiplication  is 
reducible  to  it;  the  straightforward  reduction  is  based  on  the  fact  that 
the  inverse  of  the  3n  x  3n  triangular  matrix 


— , 
M 

> 

< 

■ 

tH. 

0  I  B 

is 

0  I  -B 

0  0  I 

0  0  I 

^  J 

i.e.,  it  contains  an  n  X  n  block  equal  to  the  product  AB. 

This  paper  is  organized  as  follows;  In  Section  2  we  review  the 
general  scheme  for  inverting  an  n  X  n  triangular  matrix,  and  evaluate 
two  network  implementations,  corresponding  respectively  to  block¬ 
partitioning  the  matrix  and  choosing  extreme  values  for  the  block  size 

in  the  allowable  range.  These  two  inverters  are  referred  to  as 

2 

"recursive  and  "systolic"  respectively;  with  respect  to  the  AT 

statistics  only  the  latter  is  optimal  for  T  *  0(n).  In  Section  3  we 

show  that  the  recursive  and  systolic  inverters  can  be  combined  to  build 

2  4 

networks,  called  "mixed"  inverters,  which  realize  the  AT  *fl(n  )  lower 

2 

bound  for  all  values  of  T  such  that  O(log  n)^TS  0(n). 
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'  . 


2.  The  general  scheme  for  Inverting  a  nonsingular  triangular  matrix. 

Let  A  be  a  nonsingular  nXn  triangular  matrix^  to  be  thought  of 
as  an  n/sXn/s  matrix  whose  elements  are  sXs  blocks  of  the  original 
entries  (s  is  a  parameter  in  the  range  [l,n/2]);  let  be  the  (i,j) 
block  of  A(i,  j  *  1,2, . . .  ,n/s)  and  let  A^*^  be  the  corresponding  block 
of  A  It  is  well  known  —  and  also  straightforward  to  verify  —  that 


A(_1) 

Aij 


fA(_1)  A(_1)  A(_1)  ]• 

lAii  ,Ai,i+l’***’Ai.i-lJ 


*■>  j"l 


1+1,  J 


‘J-I.J 


This  general  formula  will  now  be  specialized  to  two  interesting  cases. 
2.1  Recursive  inversion 

The  standard  scheme  for  the  parallel  inversion  of  a  triangular 
matrix  [9,10]  corresponds  to  specializing  the  general  scheme  to  s»n/2. 
In  this  case  the  inverse  of 


1 

« 

f 

An 

A12 

is 

-A"lA  A’1! 
A11A12  22 

5  1 

r 

0 

V 

0 

A22  . 

This  immediately  suggests  a  recursively  defined  network,  containing  two 
inverters  of  n/2  x  n/2  triangular  matrices  (to  be  used  to  compute  A1  J 


'The  entries  of  all  matrices  considered  in  this  paper  are  assumed  to 
be  drawn  from  a  finite  ring,  so  that  an  elementary  finite  chip  can 
be  used  for  multiplying  and  adding  entries  in  constant  area  and  time. 
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and  A^2  in  parallel)  and  a  network  for  the  parallel  multiplication  of 

two  n/2  X  n/2  matrices  (to  be  used  to  compute  (AiJAi2^^22  in  c^e  order 

shown  by  the  parenthesization) .  In  figure  1  we  show  a  possible  layout 

2 

for  such  a  network.  Each  line  shown  carries  n  / 4  operands  in  parallel 


Figure  1.  Layout  of  the  recursive  matrix  inverter; 

Shaded  boxes  are  data  buffers. 

2 

and  the  shaded  surfaces  are  buffers  of  capacity  (n  /4);  the  core  of  the  circuit 

are  two  multipliers  of  two  (n/2)  x  (n/2)  matrices,  of  a  type  described  in  (5], 

and  called  recursive  multipliers.  Each  of  these  multipliers  has  height  and 

2  2 

width  bounded  respectively  by  (6/4)n  and  (ll/4)n  and  computes  a 

matrix  product  in  21ogn-l  time  units.  Due  to  the  recursive  definition 

of  the  inverter,  a  simple  argument  shows  that  its  height  and  width  are 

2  2 

respectively  bounded  by  (15/4)n  and  (15/3)n  ;  also,  the  computation  time 
2 

is  O((logn)  ).  Notice  therefore  that  for  the  matrix  inverter  being  described  — 
called  recursive  Inverter  -  we  have  the  following  properties  (referred  to 
are  n  x  n  matrix) : 


5 


Recursive  inverter  0(nS  O(log2n)  O(n^log^n)  (3) 


Note  thet  AT"  is  short  of  the  optimal  value  fl(n  ). 


2.1  Systolic  Inversion 


The  next  scheme  to  be  described  corresponds  to  the  choice  s  ■  1  in 
the  general  method.  The  resulting  network  is  a  mesh  of  processors,  each 
of  which  feeds  data  in  and  out,  each  time  performing  some  computation, 
keeping  a  regular  flow  in  the  network.  Such  networks  have  been  called 
systolic  by  Rung  and  Leiserson  [1]. 

With  our  choice  of  s,  block  A^  in  (1)  becomes  entry  a^k  (and  similarly 
becomes  a^k^).  The  form  of  (1)  suggests  a  computation  method  on 
an  nXn  square  mesh  (figure  2).  Only  the  upper-triangular  positions 
in  this  mesh  need  contain  processing  modules  (i.e.,  denoting  by 
the  module  in  position  (i,j),  M^j  is  deployed  only  for  j  at  i). 

Modules  are  of  two  types  with  different  computational  capabilities:  O-modules 
and  M-modules,  placed  respectively  in  diagonal  and  off-diagonal  positions. 

Entry  af,*^  of  A  ^  will  be  computed  in  place  in  M, For  i  •  j 


"22W“2 


M-modules 


Figure  2.  General  structure  of  the  systolic  matrix  inverter 
(triangular  mesh) 


(diagonal  entry),  the  corresponding  D-module  must  Invert  entry  a^, 
i.e.,  a^^  ■  1/c  ( ,  ;  for  j  >  1,  the  process  is  more  complex  and  is  to 

be  analyzed. 

Each  operation  —  inversion  of  an  entry  or  multiplication  of  two 
entries  —  is  conventionally  assumed  to  take  one  "step".  Assume  indue* 
tively  that  for  fixed  t  and  2k-h  <  t,  the  computation  of  a^^  be  com* 
pleted  in  at  the  end  of  the  (2k-h)*th  step;  the  basis  for  this 
induction  is  easily  established  for  h  ■  k,  in  which  case  we  just  activate 
at  the  h-th  step.  We  now  extend  the  induction  by  showing  that  a^ 
with  2j-i*t,  can  be  computed  at  the  end  of  the  t-th  step.  Indeed, 
a^  ^  (p  <  j)  is  computed,  by  the  inductive  hypothesis,  at  the  end  of 
the  (2p-i)-th  step;  after  this  computation  is  completed,  a^  ^  is 

shifted  to  the  right  along  the  i-th  row  of  the  mesh  (figure  2)  (one 

position  per  step),  so  that  a^1^  resides  in  at  the  end  of  the 
[ (2p*i)  +  (j-p)]-th  step.  Similarly  entry  ap j (P  <  j)  is  shifted  upwards 
along  the  j*th  column  of  the  mesh  of  figure  1  (one  position  per  time 
step)  starting  at  step  (j  +  1),  so  that  a^j  resides  in  at  step 

(p-i+j),  simultaneously  with  Thus,  the  product  a^”^  *  apj  can 

be  computed  in  the  step  (p-i+j),  and  accummulated  in  M^.  This  shows 
that  the  inner  product  [af , ^ , . . . ,af  ^ , ]  •  [a, ,,..., a.  .  .]  resides  in 
at  the  end  of  step  (j-l)-i+j  •  2J-1-1.  Now,  recall  that, by  the 
Inductive  hypothesis,  aj"1^  Is  computed  in  at  step  j;  therefore 
it  can  be  shifted  upwards  in  the  mesh  and  after  (j-i)  steps  (i.e.,  at 
step  t«2j-i)  it  arrives  in  M^,  where  the  computation  of  aj[j1)  is 
completed  at  the  end  of  the  t*th  step,  as  was  originally  claimed. 

For  clarity,  in  figure  3(a)  we  illustrate  the  timing  of  the  compute 
tions:  Each  module  is  labelled  with  an  integer  which  denotes  the  step 


at  which  computation  in  that  module  is  completed.  Also,  in  figure  3(b  and 
c)  we  present  snapshots  of  the  data  participating  in  the  horizontal  and 
vertical  flow,  respectively,  at  step  7.  Clearly  the  calculation  of  A  ^ 
is  completed  in  2n-l  steps. 

A a  regards  implementation  details. 
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Figure  3  (a):  timing  of  completion  of  computation  up  to  step  7. 

(b) :  data  (x)  participating  in  horizontal  flow  at  step  7. 

(c) :  data  (x)  participating  in  vertical  flow  at  step  7. 

initially  module  is  loaded  with  a^.  While  (i«l,...,n)  must  simply 
compute  1/a^,  an  M-oodule  (ii*j)  can  be  designed  with  one  operand  register 
R  and  two  buffers  H  and  V;  there  are  two  operand  input  lines  W  and  S,  and 
two  operand  output  lines,  E  and  N  (see  figure  4).  Buffers  H  and  V 
constantly  feed  output  lines  E  and  N,  respectively  and  the  module  must 
be  capable  of  executing  the  following  instructions: 

1.  R  -  R  +  WS,  H  -  W,  V  -  S,  for  the  general  step. 

H  •“  -R*S,  V  **  S,  for  the  final  step. 


2.  R 


-R*S, 
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It  Is  readily  realized  that  this  is  all  is  needed  to  implement  the 


According  to  our  original  assumption  that  both  the  area  of  the 
processing  modules  and  Che  time  needed  to  execute  any  of  the  prescribed 
operations  be  bounded  by  a  constant,  we  have  the  following: 

(4) 

2 

i.e.,  the  network  is  optimal  for  the  AT  measure.  The  optimal  be¬ 
havior,  however,  is  achieved  only  for  T  ■  0(n).  An  interesting  question 
is  whether  it  can  be  extended  to  a  wider  range  of  processing  times. 

This  question  is  addressed  in  the  next  section. 
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3.  Mixed  networks. 

We  now  describe  how  to  combine  the  recursive  and  systolic  inverters 

2 

described  in  the  preceding  section  in  order  to  improve  the  AT  measure  for 
a  wide  range  of  the  time  parameter  T. 

The  resulting  networks  -  to  be  called  mixed  -  have  the  following 
general  structure.  A  mixed  network  is  a  systolic  scheme,  as  the  one 
described  in  2.2,  where  the  "operands",  rather  than  being  elementary  entries, 
are  blocks  of  s  X  s  such  entries.  In  the  corresponding  n/s  x  n/s  triangular 
mesh  (see  figure  2),  the  modules  must  now  be  designed  to  process  s  x  s 
blocks.  The  layout  of  mixed  networks  is  chosen  as  in  figure  3,  where  the 
modules  themselves  have  been  conveniently  assumed  to  have  a  rectangular 
shape  on  the  chip  (else,  we  consider  the  smallest  rectangle  with  sides 
parallel  to  the  coordinate  axes  which  contains  the  module).  From  figure  3, 
it  is  clear  that  while  the  dimensions  (width  and  height)  of  the  M-module 
determine  one  dimension  of  the  network  -  say,  its  width  -,  the  other 
dimension  -  say,  its  height  -  is  determined  by  the  larger  of  the 
corresponding  values  for  the  D-  and  M-modules . 


Figure  3. 


•  General  layout  of  mixed  networks.  Each  line 
shown  carries  in  parallel  s?  operands. 
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The  first  kind  of  mixed  networks  to  be  considered  is  one  for  which 
the  following  selections  are  made  (Type-1  mixed  inverter)  : 

(1)  D-modules  are  recursive  inverters,  as  described  in  Section  2.1; 

2  2 

they  have  width  5s  ,  height  (15/4)s  ,  and  computation  time 
0(log2s) . 

(2)  M-modules  are  recursive  matrix  nultipliers  of  the  Preparata- 

Vuillemin  type  [5],  as  already  used  to  build  the  recursive 

inverter.  They  can  be  placed  on  the  chip  so  that  their  width 
2  2 

and  height  are  2s  and  (15/4)s  (see  [5]  for  details);  their 
computation  time  is  O(logs). 

Since  the  heights  of  the  D-  and  M-modules  are  of  the  same  order  (in  this 

case,  they  are  identical)  the  height  of  the  network  is  0(—  x  s2)  *  0(ns); 

2 

since  also  the  widths  of  the  two  modules  are  0(s  ),  the  same  holds  for 

2  2 

the  width  of  the  network.  Thus,  A  *  0 (n  s  ) ,  and  the  smallest  containing 

rectangle  is  nearly  a  square  with  both  sides  0(ns).  As  regards  computation 

time,  the  blocks  A^  (i  *  l,...,A/s)  are  all  computed  in  time  O(log  s), 

and  after  this,  the  mesh  computation  begins.  We  have  observed  in  Section 

2.2  that  the  systolic-network  completes  its  computation  in  0(n/s)  steps, 

2 

whence  the  total  computation  time  is  T  *  0(log  s  +  —  logs).  If  we  bound 

the  parameter  s  by  s  S  n/logn  we  obtain  T  ■  0(-  logs),  and  the  performance 

s 

of  Type-1  networks  is  summarized  as  follows: 


Type 


Type-1  mixed 
inverter 


A 

T 

AT2 

2  2 

0(n  s  ) 

0(“logs) 

4  2 

0(n  log  s) 

(5) 


const . *  s  *  n/ logn 
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The  second  kind  of  mixed  networks  (Type-2  mixed  inverter)  is 
constructed  as  follows: 

(1)  D-modules  are  type-1  mixed  inverters  (for  s  x  s  matrices). 

According  to  the  preceding  discussion,  for  any  value  of  a 

parameter  r  <  s/logs,  these  modules  have  height  and  width 

2  s 

both  0(sr)  and  computation  time  O(log  r  +  —  logr) .  Choosing 

2  2 

r  *  s/logs  we  obtain  height  0(s  /logs),  width  0(s  /logs),  and 
2 

time  O(log  s) . 

(2)  M-modules  are  pipelined  matrix  multiplier,  as  introduced  by 

Preparata  and  Vuillemln  in  [5].  It  is  shown  in  [5]  that  one  such 

2 

multiplier  can  be  designed  with  height  and  width  both  0(s  /logs) 
and  computation  time  O(logs). 

2 

Again,  the  dimensions  of  both  D-modules  and  M-modules  are  0(s  /logs), 
whence : 


With  respect  to  computation  time,  we  obtain  the  same  conclusions  as  for 
type-1  mixed  inverters,  i.e.. 


T  -  0(log^s  +  ~  logs). 


Therefore  we  obtain 


,  /  2  2  ,  9\ 

.  _2  „ In  s  2  .  n  ,  ,2' 

AT  *  0|  - s —  •  (log  s  +  -  logs) 

S  J 


0  (n 


log  s 
4 


•(1  +  -  logs) 


1 


-  0 


n^s^  nZ  ,  2  ,,  .  s  ,  ,2\ 

- j-  •  log  sXl  +  -  logs) 

klog  ss  / 


2  4 

Obviously,  if  s  ^  n/logn  we  have  slogs  <  n,  whence  AT  ■  0(n  ),  and  the 
performance  of  the  Type-2  mixed  inverter  is  so  summarized: 


Since  as  s  varies  from  a  small  constant  value  to  n/logn  the  computation 

o 

time  T  varies  from  0(n)  to  O(log  n),  we  say  that  the  above  network  meets 

the  AT^  *  O(n^)  optimal  bound  for  all  T  such  that  O(log^n)  <  T  <  0(n). 

Incidentally,  even  in  totally  unrestricted  models  of  computation  -  as 

2 

the  shared -memory-machine  [see,  for  example  [10]]  -  0(log  n)  is  the 
smallest  known  running  time  for  inverting  a  triangular  matrix. 
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